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Abstract 

Multitask clustering tries to improve the clustering performance of multiple tasks simultaneously by 
taking their relationship into account. Most existing multitask clustering algorithms fall into the type 
of generative clustering, and none are formulated as convex optimization problems. In this paper, we 
propose two convex Discriminative Multitask Clustering (DMTC) algorithms to address the problems. 
Specifically, we first propose a Bayesian DMTC framework. Then, we propose two convex DMTC 
objectives within the framework. The first one, which can be seen as a technical combination of the 
convex multitask feature learning and the convex Multiclass Maximum Margin Clustering (M3C), aims 
to learn a shared feature representation. The second one, which can be seen as a combination of the 
convex multitask relationship learning and M3C, aims to learn the task relationship. The two objectives 
are solved in a uniform procedure by the efficient cutting-plane algorithm. Experimental results on a toy 
problem and two benchmark datasets demonstrate the effectiveness of the proposed algorithms. 

Index Terms 

Convex optimization, cutting-plane algorithm, discriminative clustering, unsupervised multitask learn- 
ing 

I. Introduction 

With the rapid development of information technology, massive amounts of unlabeled task-specific 
data are generated every day. Many tasks can be seen as self-contained, yet somewhat similar. Because 
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labeling the data manually is time-consuming and expensive, we often resort to clustering algorithms for 
mining the undiscovered knowledge in the data. 

In traditional data mining studies, we do clustering to each task independently. However, some tasks 
have so few data that the data distributions cannot be covered well. Hence, it is natural to think 
about clustering several unlabeled tasks together for improving the performance on each individual task. 
However, although some tasks are similar, there are still many tasks mutually unrelated, dissimilar, and 
even reverse. Simply merging all tasks together for clustering might be harmful. Therefore, it is urgent to 
develop a Multitask Clustering (MTC) algorithm that 1) not only is powerful in clustering each individual 
task 2 ) but also can mine the task relationships automatically from the data so as to further improve 
the clustering performance. For achieving our goal on MTC, we need to resort to two research areas - 
Multitask Learning (MTL) and clustering. 

Multitask Learning: MTL [1], also known as learning to learn (21, learns multiple (probably) related 
tasks simultaneously for improving the generalization performance on each task. It can be reviewed in 
three respects. They are 1) "what to learn", 2) "when to learn", and 3)"how to learn" |[3l . 

"What to learn" asks what knowledge is shared across tasks [3 ]. In this respect, the MTL techniques can 
be categorized to two classes: The first class is to share common feature or kernel representations, such 
as sharing the hidden units of neural networks JT], 0], ||5), sharing a common representation within the 
regularization framework fl6l — ftTTI . etc. The second class is to share common model parameters, such as 
placing a common prior across tasks within the hierarchical Bayesian framework lfT2l - |[T4l . learning the 
differences of the task-specific models in Frobenius norms under the regularization framework 1031 - 11171 , 
etc. 

"When to learn" asks in which situation the tasks can share. Specifically, many MTL algorithms assume 
that the tasks are mutually related which is an ideal situation. In practice, there might be some outlier 
tasks or tasks with negative correlation. Learning with these tasks results in negative transfer or worsened 
performance. Hence, how to discover the task relationship is another key issue that is becoming more 
and more attractive |@], irT7l -[20|. One method is to group tasks into several clusters where the tasks 
in different groups are regarded as unrelated |@], [fT~8l — T20 1. Another method is to learn the inter-task 
covariance matrix of the Gaussian process prior |H71 . 

"How to learn" asks how the optimization problem can reach a good solution (i.e. performance) 
in a reasonable time when the first two respects are specified. In respect of effectiveness, among the 
aforementioned MTL methods, the convexity of the optimization objectives is always desired since the 
global optimum solutions can be achieved and the optimization can be simplified. Until present, several 
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convex MTL algorithms have been developed, and better performance was reported ID, EH, El, [18], 
ETI . In respect of efficiency, the alternating optimization method that optimizes in turn one parameter 
with others fixed is a common efficient method. 

Summarizing the aforementioned, in the new MTC design, we should take the convexity and the task 
relationship mining as two important considerations. 

Clustering: Clustering is the process of partitioning a set of data observations into multiple clusters 
so that the observations within a cluster are similar, and the observations in different clusters are very 
dissimilar [22]. Since the early works on /c-means, many clustering algorithms have been developed, such 
as kernel fc-means, spectral clustering [23], ll24l . hierarchical clustering, probabilistic-based clustering, 
metric clustering, clustering nonnumerical data, clustering high dimensional data, clustering graph data, 
etc. 

Like supervised classification, clustering algorithms can be classified to two classes - generative 
clustering and discriminative clustering. The generative clustering algorithms model p(x, y; 9) where 
x and y denotes the input and output of the learning system respectively and 9 is the parameter. 
The discriminative clustering algorithms only focus on modeling p(y\x;0). Many traditional clustering 
algorithms fall into the class of the generative clustering, such as /c-means, Gaussian mixture model, 
restricted Boltzman machine, etc. However, when we only care about the predicted labels but not 
the distribution of the observations, the generative clustering methods seem solving a more general 
problem than what we want. Moreover, if we make a wrong model assumption on the underlying data 
distribution, we may get a rather weak clustering result. This phenomenon has been observed in both the 
supervised classification ll25l and the clustering ll26l . Due to the above problems, many discriminative 
clustering methods have been developed B4l . |[26l - |[35l . such as spectral clustering Il24l . Maximum 
Margin Clustering (MMC) Ir28l - |[33l , regularized information maximization [34], etc. 

Summarizing the aforementioned, in the new MTC design, we should try to construct a discriminative 
MTC clustering algorithm but not a generative one. 

Multitask Clustering: Although the supervised MTL has been studied extensively in the aforemen- 
tioned respects, the unsupervised MTL, i.e. MTC 11361 . seems far from explored yet. Only very recently, it 
received more and more attention It36l — T46 1 . 1) In respect of "what to learn", in ||36l , Teh et al. proposed 
to discover the clusters that can be shared via the hierarchical Dirichlet process. In [47], Kulis and Jordan 
first revisited a regularized /c-means algorithm in the view of the Dirichlet process and then extended 
it to MTC by sharing the clusters of the observations across the tasks. In 11371 , Dai et al. extended the 
information theoretic co-clustering algorithm to MTC by making the tasks share the same feature attribute 
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cluster, where they studied MTC in the transfer learning scenario, a special case of MTL that focuses 
on the performance of one target task. In Il38l - ll42l . 11441 . Bol . the authors tried to learn a shared feature 
or kernel representations in different distance metrics, such as Bregman distance. 2) In respect of "when 
to learn", in ||39l , [40], Zhang and Zhang proposed the pairwise task regularization and centralized task 
regularization methods for discovering the task relationship. 3) However, in respect of "how to learn", 
none of the MTC algorithms can hold the convexity. 

Moreover, most of the MTC algorithms belong to the class of the generative clustering. To our best 
knowledge, the discriminative MTC seems lack of full study. Only in BT1 . [45], the authors proposed 
the spectral clustering based MTCs. 

Contributions: In this paper, we propose a new Bayesian Discriminative MTC (DMTC) framework. 
We implement two DMTC objectives by specifying the framework with four assumptions. The objectives 
are formulated as difficult Mixed Integer Programming (MIP) problems. We relaxed the MIP problems 
to two convex optimization problems. The first one, named convex Discriminative Multitask Feature 
Clustering (DMTFC), can be seen as a technical combination of the convex supervised Multitask Feature 
Learning (MTFL) HI and the Support Vector Regression based Multiclass MMC (SVR-M3C) [33]- The 
second one, named convex Discriminative Multitask Relationship Clustering (DMTRC), can be seen as 
a technical combination of the convex Multitask Relationship Learning (MTRL) ifTTl and SVR-M3C. 
These combinations are quite natural and yield the following advantages: 

1) In respect of "what to learn", DMTFC can learn a shared feature representation between tasks. 
DMTRC can minimize the model differences of the related tasks. Both algorithms, as discriminative 
clustering algorithms, try to find the optimal label pattern directly. Both of them work in Frobenius 
norms under the regularization framework. 

2) In respect of "when to learn", DMTRC can learn the task relationship automatically from the data 
by learning the inter-task covariance matrix. 

3) In respect of "how to learn", both algorithms are generated from the Bayesian framework. Both 
of them are formulated as convex optimization problems, and are solved in a uniform optimization 
procedure. A number of efficient SVM techniques are available for the problems. In this paper, we 
employ the cutting-plane algorithm ||48l - ||50l that has achieved a great success in SVM to solve the 
DMTCs efficiently. 

Experimental comparison with 7 single task clustering algorithms and 3 state-of-the-art MTCs on the 
pendigits toy dataset, the multi-domain newsgroup dataset, and the multi-domain sentiment dataset 
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demonstrates the effectiveness of the proposed DMTCs. 

The remainder of the paper is organized as follows. In Section HI1 we briefly review two related 
techniques - the convex MTL and the convex MMC. In Section [Till we propose a Baysian framework 
for DMTC. In Sections HVl and [VT we present the covex DMTFC and DMTRC objectives respectively. In 
Section [V]] we solve DMTFC and DMTRC within a uniform optimization procedure. In Section IViTl we 
extend DMTC to nonlinear kernels. In Section IVIIII we analyze the complexity theoretically. In Section 
HXl we show the effectiveness of DMTC empirically. Finally, in Section [X] we conclude this paper and 
present some future work. 

We first introduce some notations here. Bold small letters, e.g., w and en, indicate column vectors. 
Bold capital letters, e.g., W, K, indicate matrices. Letters in calligraphic bold fonts, e.g., A, B, and R, 
indicate sets, where M. d denotes a d-dimensional real space. m (l m ) is a vector with all m entries being 
1 (0). Id is a d x d identity matrix. The operator T denotes the transpose. The (x, y) defines the inner 
product of x and y. The operator || • || m denotes the m-norm, where m is a constant. The operator "tr(-)" 
denotes the trace of matrix. The abbreviation "s.t." is short for "subject to". h(oc; (3) denotes a function h 
with parameters a and /3. The symbol {W c }^ =1 is short for the set {Wi, . . . , Wp}. Without confusion, 
we may further write {W c }^ =1 as {W c } c in equations for simplicity. 

II. Related Work 

Convex Multitask Learning: There are several convex MTL algorithms in literature, such as ll8l . 
ifTTTl . ifTTl . ifTSl . |[2T1l . Due to the length limitation of the paper, we introduce two related convex MTL 
algorithms as follows: In JS), Argyriou et al. proposed to minimize the empirical risk of all tasks with a 
Frobenius norm penalty on the differences of the task-specific models, which is a non-convex optimization 
problem. Then, they proved that the problem is equivalent to a convex optimization problem - Multitask 
Feature Learning (MTFL). In flTTTn. Zhang and Yeung first tried to learn the task covariance matrix of 
the Guassian process in the regularization framework by utilizing the relationship between the Guassian 
process and the regularization. Because the concave function with respect to the covariance matrix variable 
makes the objective non-convex, they further replaced the concave function by two convex constraints, 
which results in a convex optimization problem, named MTRL. 

We found that MTFL and MTRL can be explained together in the Bayesian framework which con- 
tributes to our motivation on the Bayesian DMTC framework. 

'Best Paper Award of UAI-2010 
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Convex Maximum Margin Clustering: Among the numbers of discriminative clustering algorithms, 
MMC Il28l - f33l . which is an unsupervised extension of Support Vector Machine (SVM), has received 
much attention since year 2005. The key idea of MMC is to find not only the maximum margin hyperplane 
in the feature space but also the optimal label pattern, such that if an SVM trained on the optimal 
label pattern, the optimal label pattern will yield the largest margin among all possible label patterns 
{y|y = {ViYj=ii Vyj'}' where n is the number of observations and yj denotes the possible class of 
the j-th observation. The main difficulty of MMC lies in that it is originally formulated as a difficult 
Mixed-Integer Programming (MIP) problem |28 | due to the integer vector variable y in the objective of 
MMC. 

To overcome MIP, researchers either relaxed the objective as convex optimization problems [28], ||29l , 
11321 , ||33| or reformulated it to non-convex ones ||3"0| , I0T1 . Because the convex relaxation methods achieve 
better clustering results than non-convex ones in general, we pay particular attention to this kind. 

Originally, in [28], Xu et al. proposed to reformulate MMC as a convex semi-definite programming 
problem by relaxing M = yy T to a continuous matrix. In [29], they further extended the binary-class 
MMC to the multiclass scenario which has a time complexity as high as (n 6 5 ). Recently, in ll33l . Zhang 
and Wu proposed to construct a convex hull II5TI on {y}, and further extended the binary-class algorithm 
to the multiclass problem, i.e. SVR-M3C, which can be solved in an alternating method in time (n log n). 

We found that SVR-M3C and MTFL/MTRL can be combined quite naturally within the proposed 
DMTC framework, and a number of popular SVM techniques are available for solving the problem 
efficiently. Therefore, MMC contributes to the implementation of the proposed DMTC framework. 

Cluster Ensemble: The most similar work with MTC in machine learning and data mining is cluster 
ensemble |[52l - |[60l . The cluster ensemble aims to combine multiple clusterings with a so-called consensus 
function for enhancing the stability and accuracy of the base clusterings. The scenario that each base 
clusterer processes only a part of the observations is called the observation-distributed scenario 11521 . 
11531 or crowdclustering 11571 . 11591 . The main difference between MTC and the crowdclustering is that the 
crowdclustering assumes that all parts of observations are sampled from the same underlying distribution 
while MTC does not assume so. But, we have to note that several cluster ensemble techniques can be 
adapted to MTC, such as 1551 , 115711 , 115911 , |[60lR Still, to our knowledge, none of the cluster ensembles 
can both hold convexity and be constructed on discriminative clusterings. 

2 Best Student Paper Award of SDM-2011 
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III. Bayesian Framework of Discriminative Multitask Clustering 

Suppose there are m clustering tasks. The i-th task consists of ni unlabeled observations | x j| . > 
x* € M. d . We cluster each task to the same number of classes, denoted as C with C > 2. The prediction 
function of the c-th class for the z-th task is defined as /c(x' ) = wf c x l , where w ijC is the parameter of 
The observation x* is assigned to the c*-th class, if c* = argmax c f^. c holds. Note that we omit the 
bias term 6j c in f* for simplicity. 

For a C class clustering problem, the discriminative clustering algorithm models p (y|x; {w c }2 =1 ), 
where y G {1, 2, . . . , C}. We further extend y to a C dimensional indicator vector y, i.e. y = [yi, . . . , yc], 
where the label vector y takes 1 for the A;-th element and — for the others when y = k. For instance, 
if x falls into the first class, then y = [1, — ^zr, . . . , — ctztJ- This coding method is a common strategy 
in the multiclass problems, such as fc-means. Note that y is a row vector. Here, a set B y is defined for 
all possible y, i.e. B y = { [1, . . . , -^], [~c=T, 1, • • • , -^=r], • • • , [-^=1* ~^T, ■ ■ ■ > !]}• 

For a m-task MTC problem, we denote W c = [w l c , . . . , w m c ], X 2 = [x|, . . . , x^.] , and Y l = 
[(yl) T ) • • • j (ynJ T ] T - We try to optimize {W c } c=1 under the Bayesian framework: The maximum a 
posteriori estimation of {W c }^ =1 is formulated as 



{wS'}/ ({Wc}c ' {Y,}l {Xi> ' 



= max P ({W c } c )p i{Y l }i {X% {W c } c . (1) 

{W C } C ,{Y*} 4 V / 

Eq. £lj contains two parts. The first part p({W c } c ) is a prior that defines the task relationship. The 
second part is a discriminative clustering model that covers all tasks. How to specify the prior and the 
discriminative model is the central problem. 

Now, we make four probabilistic assumptions on problem ([[} for balancing the difficulty of solving 
DMTC and the effectiveness of DMTC. 

a) Class evenness assumption. We assume that the empirical label marginal distribution p{y) in each 
task is known and distributes evenly. This assumption has been adopted by many discriminative clustering 
algorithms, such as the class balance constraint assumption in MMC [28], ||33l and the maximal entropy 
assumption ll34l . We prefer the class balance constraint assumption in 1331 since it can simplify the 
mathematical form of (0Q) and is tunable. The constraint set B l is defined as: 



B l = t Y l 



ha ^ISa 

"TjTT — ni — ^> c > ^ c = 1) • • • ) C) 



y) G By, Vj = 1, 



, 7li. 



(2) 
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where y' c = [y\ c , . . . , y l C ] T denotes the c-th column of Y l and {{^,c}£Li}^i are user defined param- 

i i T f 1 

eters that control the class balance. The constraint — < c < Zj jC specifies the class evenness of 
the c-th class, while the constraint yj £ £>y commands that Y* must be a legal indicator matrix. This 
constraint set means that the indicator matrices who violate the constraints have probability to appear, 
while the matrices who obey the constraints have an equal chance to appear. As will be shown in the 
experimental section, a correct class balance assumption is very important to the success of DMTC. It 
not only can help DMTC detect a reasonable label pattern but also can prevent the interference of the 
outliers. If we know the class distribution, we can set li c to a value that is around l^.y**/nj where y** 
is the c-th column of the ground truth label matrix of the i-th task, otherwise, we can just set all Z/ jC to 
the same empirical value. 

b) Gaussian process prior assumption. The prior defines what to share in MTC. In this paper, we 
follow Zhang and Yeung's formulation [17 equation 2] for the Gaussian process prior. 

C / m \ 

P«W C } C ) = [] g(W c )n^Kc|O d ,a?I d ) (3) 

C=l V 1=1 / 

where M (A, B) is a multivariate normal distribution with A and B as the mean and covariance matrix 
respectively, and g(W c ) is a matrix-variate normal distribution that defines what to share between tasks. 
As will be shown later, M (wj )C |Od, dfld) plays a regularization role on the task-specific model Wj c , i = 
l,...,m. Note that restricting all tasks have the same covariance afld might be too tight. In practice, 
we can use different covariances for different tasks. 

In this paper, we consider two kinds of q(W c ). The first kind defines a shared feature representation: 

exp (-§tr(W«TD- 1 W c )) 

m W c) - ( 27r )md/2| D |d/2 W 

where the mark t is short for feature, and D is a covariance matrix that models the relationships between 
the features. The second kind follows Zhang and Yeung's formulation |[T7l equation 2], which defines 
one type of relationship between the tasks: 

e X p(-itr(W c »- 1 WD) 

W Cj " ( 27r )md/2| n |m/2 W 

where the mark t is short for task, and f2 is the covariance matrix that models the relationships between 
different task-specific models Wi jC . 

c) Task independence assumption. We assume that when {W c } c is sampled from the prior distribution, 
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the tasks are mutually independent: 

m 

P (pr}i\vc} it {wj c ) = ]Jp (y*|x*, {w*} c ) 

i=l 

m C m rii C 

=n o = n n <) . © 

i=l c=l i=lj=lc=l 

With this assumption, we can incorporate any advanced binary-class discriminative clustering algorithm 
into p (y*|X*,w*) without modifying the clustering algorithm significantly. 

d) Gaussian assumption on the discriminative clustering model. We assume p (yj c |x*-,wj,l in © is 
Gaussian: 

p (yi c |xj, w;, c ) = TV (yj-Jw^x}, of) . (7) 

This assumption makes the discriminative clustering a regression problem but not a classification problem, 
which might not be the real case since c G {— ^^-j-,!} is a discrete variable. However, it is known 
that even in the supervised classification problem, if we set problem © with a non-Gaussian likelihood, 
the computations of predictions are analytically intractable lloTl page 39]. Moreover, the regression based 
classifiers have been widely adopted, such as least-squares SVM. 

IV. Convex Discriminative Multitask Feature Clustering 

In this section, we will introduce the convex objective function of the proposed DMTFC. 
Substituting Eqs. ©-dD, © and §7} into problem dTJ and taking the negative logarithm of CO can 
derive the following objective function: 

c 



11)11) 11). Ill 1.1111.1 > I > — > IC^-w^xf 



+^tr (WjW c ) + ^tr (W^D- 1 W c ) + ^ In |D|) (8) 

where Ai and A2 are two tunable regularization parameters that are related to o\ and oi- 
Problem ([8]) is non-convex, because In |Dj is a concave function and B % is a set of integer matrices. 

In this section, we will relax ((8]) to a convex optimization problem that should be convex with respect 

to both the objective function and the constraints ||51"I . 

In respect of the objective function, we replace In |D| by the following convex constraint set: 

V = {D|D G M dxd ,D y 0,tr(D) = 1} (9) 
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which results in the following MIP problem: 

C 



min min min ( — tr (wTD 1 W c ) 

. m-.ru \ 

+^tr (W^W C ) + £ 1 £ (& " w^) 2 ) . (10) 

i=i 1 j=i ' 

We can see that problem (TTOl is quite similar with ||8] Theorem 1] except that (fTOl is a regularized 
multiclass problem with label ~Y % as an integer matrix variable. 

In respect of the constraints, we will construct a convex hull 1511 on B 1 as IT321 . J33| did. Specifically, 
fixing {Y 1 }^ and D, problem (fTOl is formulated as: 

c=l \ c i=l 1 j = l 

+^tr(W^W c ) + ^tr (W^D- 1 W c ) j (1 1) 

where the problems in the big brackets are mutually independent. According to the Karush-Kuhn-Tucker 
conditions l62l . the dual form of the problem in the big brackets of (fTTT) can be written as: 

m m 

h c (a c ; D; {?*}.) = £ £ 44^ " «^Ka c (12) 

i=l j=i 

where a c = [a\ c , . . . , a™ m C ] T are the dual variables, K = Kmtp + with A as the diagonal matrix 
whose diagonal element equals to m if the corresponding observation belongs to the i-th task, and Kmtf 
denoted as the multitask kernel matrix for feature learning that is defined as: 



K MTF (x£ , x£ J = x£ D(AiD + Aal,,)-^ (13) 
with e, as the i-th column of I m . W c is obtained as: 

m rii 

W C = ^J] 4 C D (AiD + Asld)" 1 xjef. (14) 

i=i j=i 

Substituting (fT2l back to problem (TTTb can get an equivalent optimization problem of ([TO as follows: 

c c 
£max/i c fa c ;D;{y*} ) = max ]T ft c (a c ;D;{y l c } ) 

4 ma x ^({ae^D^Y*}.). (15) 
Substituting (fT5T ) back to (fTOl can get an equivalent optimization problem of (TTOl as: 

min min max hy, ( \a r \ r : D; { Y*) .) . (16) 

{Y«eB*}™ 1 D6'D{a <! }? =1 U J 
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According to the minimax theorem [63 1, the optimal objective value of problem ([Tol l is an upper bound 
of max{ a j c minD min.{Y*} 4 ({ctc}c; D; which means that problem (TToT l can be rewritten as: 



max min < max— 8 (17) 
{<* c }? =1 Dec [ e 

s.t. 9 > h s ({« C } C ;D; {Y^}.) , Vi, k : Y\ e £ 4 } . 

Reformulating the problem in the braces of (UTT i to its dual can get the following equivalent problem: 

max min min hs I {a c } c : D; < Y' > 

= min min max /is I jade; D; i Y l > 
{Y. 6 g.}™ i Dei){ ac }c =i V J I J 

c 

= min min max > a* r iAy\, , ■. . — > ajK.a c 

(18) 

where yjj. ^ c is the element of Y£ at the j-th row and c-th column, B l = j Y l Y l = X]fc:Y* eS j /"l^l' A** e 
with .A/f defined as M l = j/x*|0 < fj,\ < 1, X]fc-Y*ee* ^1 = l}» ^* * s tne convex hull of B l ||5T] page 
24], which is the tightest convex relaxation of B\ 

Writing the objective function in (fT8l l back to its primal form can derive its equivalent problem: 



min min V — tr (WjD^Wc 



2 



+^tr(Wf W c ) + £ 1 4yLc - wLx}) 2 ■ (19) 

Theorem 1: Problem (fl9l ) is convex with respect to {/Li 1 }™-, {W c } c=1 , and D. 

Proo/- Because {Af }™ r {M dxm })l 1 and X> are all convex sets, their Cartesian product M X 
. . . x 7W m x M dxm , . . . ,R dxm x P, i.e. the constraint, is also convex ED page 38], where n = Yli n i- 
It is easy to see that the first and third terms of the objective function are convex by verifying that 
their Hessian matrices are positive semidefinite ||5T1 page 71]. The second term has been proved to be 
convex in |8|. Because the summation operation can preserve convexity, the objective function is convex. 
Therefore, problem ([T9l is jointly convex with respect to all variables. ■ 

Summarizing the aforementioned, problem (119) is a convex relaxation of the original problem d8). It 
has two equivalent forms (TTTT ) and (TT8T ). Problem (flTT i is the objective function of DMTFC. 
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V. Convex Discriminative Multitask Relationship Clustering 

In this section, we will introduce the convex objective function of the proposed DMTRC. 
Substituting Eqs. ([3]), and ©-(IT]) into problem £[]) and taking the negative logarithm of (OQl can 
derive the following objective function: 



We can see that problem (1201 seems quite similar with problem ([8]) except that W c and D in ([8]) is 
replaced by Wj and fi respectively. However, essentially, what they learn is quite different. We can 
also observe that problem (|20T > seems quite similar with [17. equation 5] except that (|2Qb is a multiclass 
problem and Y* is a integer matrix variable. However, this "slight" difference makes (l20b a hard MIP 
problem. 

Observing that the factors that cause problem © and problem (|2Qb non-convex are the same, we can 
use a similar convex relaxation procedure with (|8}'s for (120T) . For the length limitation of the paper, we 
only report the main results. 

The relaxed convex optimization problem of problem (1201 is formulated formally as follows: 





(20) 




(21) 



where A is a convex constraint set defined as: 



A= {n\n e R 



mxm 



f2 >z 0,tr(«) = 1}. 



(22) 



Theorem 2: Problem (|2TT > is convex with respect to {ju 1 }™^, {W c }^ =1 , and Q. 
Proof: The proof is similar with the proof of Theorem 1 . 
Problem (f2TTl has two equivalent forms. The first one is written as: 




(23) 
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where 

c 

({« c } c ; n, {Yl}.) ±J2 h c K I Y U.) 

c=l 

C I m rii 

=Y Y Y a UyUc - <*cKa c 

c=l \i=l j=l 

where K = K.mtr + 5 A, and Ka/t/j denoted as the multitask kernel matrix for relationship learning 
that is defined as ifTTl : 

tf M TK(x£,x|) = efn(X 1 n + X 2 I m )- 1 e l2 (x£,x£) . (24) 

We also obtain W c as: 

m rii 

v?c = Y.Y 4c x K n ( A i" + A 2i m ) _1 . (25) 
i=i i=i 

The second equivalent form is written as: 



min min max hr\ ( \a c \ c \ Q: \ Y l > 

C 

min min max > a) -lAyl ,• — > o^Kc 

C — L c,t,j,k C— 1 



(26) 



Summarizing the aforementioned, problem (12D is a convex relaxation of the original problem (l20T l. It 
has two equivalent forms (|2TI) and (l26l ). Problem (l23l i is the objective function of DMTRC. 



VI. Optimization Procedure 

In this section, we are to solve DMTFC (fTTT ) and DMTRC ( f23l) in a uniform framework. This framework 
utilizes the fact that there are only two different points between them: 1) the multitask kernel functions 
are different, see Eqs. <TT3T > and (l24l : 2) the convex sets V and A are different, see Eqs. © and (l22l . To 
facilitate the mathematical representation, we write (UTT l and (|23T ) as the following uniform objective: 



max min < max —9 (27) 
{<* c }? =1 z&z { e 



s.t. 9 > ^ ({« C } C ;Z; {Yl}J ,V», k : Y\ e B* j. 



where Z stands for D in {FT} or fi in J23T) . and 2 stands for V in flTJ or A in (|23T ). 

The solution framework is an alternating method. First, it decomposes the unsupervised problem (l27T l 
to a serial supervised multiclass MTL problem by the cutting-plane algorithm [48 1 and the extended level 
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method (ELM) |49l , BUI , where the decomposition algorithm can be seen as a multitask extension of 
the SVR-M3C algorithm 11331 . Then, it solves each supervised multiclass MTL problem in an alternating 
way. Note that the difference of the optimization procedure between DMTFC and DMTRC only appears 
in the supervised learning in Section IVI-CI 

A. Optimizing (127\) Via Cutting Plane Algorithm 

Because the constraint number of problem (l27l i is exponential large with respect to n, directly opti- 
mizing d27l ) is impossible when n is relatively large. Hence, we adopt the cutting-plane algorithm ||48l 
to solve problem (|27T ) approximately. 

We present the key idea of the cutting-plane algorithm as follows. Generally, given a constrained 
optimization problem, the cutting plane algorithm alternates the following two steps until the objective 
value converges. The first step is to solve a reduced problem of the constrained problem, i.e. a problem 
that contains only a part of the constraints. The second step is to find the most violated constraint of 
the reduced problem, and add it to the constraint set so as to form a new reduced problem for the next 
iteration. It has been proved that the number of the cutting-plane iterations is upper bounded by (1/e) 
||64l , where e is a user defined cutting-plane solution precision. 

For problem (|27T l, the cutting-plane algorithm iterates the following two steps: 
a) Solving the following reduced problem of problem (|27T i: 



where \y i \ denotes the size of y 1 and A4y = < /u l |0 < fj, l k < 1, ^L=i Mfc = 1 f • Here, we leave this 




(28) 



where y l is the reduced constraint subset of B l . Problem (|28T ) is equivalent to 



(29) 




complicated problem to Section IVI-BI 



b) Calculating the most violated constraint 



Yf-yii.j \ by solving the following problem and adding 




13*1+1 



to y\ 




a 




(30) 
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We can observe that the second item is irrelevant to the optimization problem, and the subitems 
S c Ylj a ) cVlyH+i j c m tne fi fSt ^ tem we mum ally independent with respect to i. Therefore, opti- 
mizing problem (l3Ql > is equivalent to optimizing the summation of the following problem: 

C n< 

Yl Yl a lcf\y\+i,j,c » Vi = 1, . . . , m. (31) 

|y*l+i c =l j=l 

Although the above problem is a binary integer matrix optimization problem, it can be solved in time 
(Y^iLi Criilog(Cni)) thanks to the constraints of Y l in (f2]). See [33 , Algorithm 6] for the efficient 
algorithm. 

B. Optimizing \29\ Via Extended Level Method 

Problem (|29l ) is a concave-convex optimization problem that is convex on /i. and Z and concave on 
a. We will optimize it via the efficient Extended Level Method (ELM) g9l, Il50l. 

We present the key idea of the ELM algorithm as follows. ELM tries to solve the concave-convex 
optimization problem max a min;, / (a, b) that is concave on a and convex on b by iteratively constructing 
tighter upper and lower bounds for the optimal objective value /(a*; 6*), where (a*, 6*) denotes the 
optimal solution. Specifically, it iterates the following two steps. The first step is to construct the lower 
bound / = min a maxi< r < s f(a; b r ) and the upper bound f s = mini< r < s f(a r ; b r ) of /, where r and s 
denotes the indices of the iterations (i.e. solutions) and maxi< r < s f(a; b r ) is also a cutting-plane model. 
The second step is to first get a s+ i by solving the following optimization problem 

min ||a s +i — a s || 2 (32) 
s.t. /(a s+ i;6 r ) <rf s + {l-T)l s ,Vr = l,...,s, 

and then get b s+ \ by solving max& s+1 /(a s+ i, b s +i), where r is a user defined constant. Eq. (l32l performs 
like a regularizer that prevents a s+ i far from a s . 

For problem (|29l i, because optimizing /x and Z jointly is difficult, setting a = {fj,, Z} is improper. 
We propose to set a = /x and optimize Z and a jointly. It's easy to prove the correctness of this new 
optimization strategy. The proof is similar with the proof of Il49l Theorem 1]. Another very important 
issue is that to make the cutting-plane algorithm presented in Section IVI-AI converges, for problem (|29l i 
at the 5-th cutting-plane iteration, we should inherit all previous 5 — 1 ELM models to initialize the 
upper and lower bounds of the problem, otherwise, the cutting-plane algorithm will fail. The proof is the 
same as ll33l Theorem 3] 
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With the aforementioned two key points, the ELM algorithm for problem (|29l l is presented as follows. 

29t . and the R-th previous ELM model contains Tr solutions, 
where R = 1, . . . , S — 1. Suppose the constraint set of the 



Suppose we are to solve the S-th problem 



denoted as | { (t4' R , Z R , } ^ j 



i-th task at the i?-th cutting plane iteration, denoted as y hR , contains \y' l,H \ constraints (|3^' K | < R) 



ii,R\ 



ri,R\ 



y 



Initialization of ELM. All previous 5 — 1 ELM models should be inherited by adding l&' r with 

. T iT 
i,R\ nT 



i,S\ 



yi,R\ Z eros: /x' 



i,R 
r 



\y<.s\-\y*.R\ 



, V«, Vr = 1, . . . , Tr. Without lose of generality, 

5-1 



we further denote all inherited ELM solutions as { (a*, Z r , f4) Y r=1 = \{(c4-' R , Z R , ju'^)}^) ,Vi, 
where s = Zr=\ Tr. 

ELM. The ELM for ([29]> iterates the following steps: 

a) Constructing the lower bound h s by h s = minjy^j y™ maxi< r < s /is {a r \ Z r ; {J2k Mfc^l/i) an< ^ 
the upper bound h s by h s = mini< r < s /is (a r ;Z r ; fc^fc} .) ■ With Z r fixed, the tasks are 



mutually independent, hence we can get the bounds for each task separately: 

/i* = min max /is I a r ;Z r ; > /4-Y 



fc=i 




\y-\ 



%= min /i s « r ;Z r ; VV ^ 



V*. 



(33) 



(34) 



fc=i 



b) Get {m^}^ by 



r min ,» ii^s+i — A*! 



i || 2 



(35) 




s.t. /is |^a r ; Z r ; < 

< tH s + (1 - r)/^, Vr = l,...,s. 
Similar with Step a), we replace problem (1331 with the summation of the following problems 

• Mi" 4 1 1 2 



.mm /x B+1 - /i fl 



(36) 



s.t. /i s a r ;Z r ;^// s+1>fc Y£ 



fc=i 



<rr s + (l-r)& Vr = l,...,a. 



3 In this subsection, we denote {a c }^ 1 as a for clarity. 
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c) Get (a s+ i,Z s+ i) by 



\c\ 

z min 2 max ft s | a s+1 ;Z s+1 ; <j ^M«+i,fe Y l } ] • (37) 



Here, we leave problem (137T ) to Section IVI-CI 



C. Optimizing di7D V?a f/ie Alternating Method 

Problem (|37T ) is a supervised multiclass MTL problem. We adopt an alternating method that is similar 
with H3 for it. 

The method iterates the following two steps: 
a) Given fixed Z, we aim to solve 

c 

max ft, s ({c* c } c ; Z; Y) = V" max/i c (a c ; Z; y c ) (38) 



where Y 



(Y!) T , . . . , (Y m )" J ' = [yx, . . . , y c ] with Y* short for 4 Y L i = 1, . . . , m 

and y c short for the c-th column of Y. When Z is fixed, the subitems in the right side of equation 
(1381 are mutually independently. Hence, we solve each item 

max/i c (a c ; Z;y c ) , Vc=l,...,C (39) 

He 

independently, which is a supervised regression problem, 
b) Given fixed {a c }^ =1 , we aim to optimize 



min/is {a c } c ;Z;Y . (40) 
zez V / 

For solving problems (|39l and (l40l . DMTFC and DMTRC should be considered separately as follows: 
Specifying l [39\> and d?0l) a /jarf of DMTFC: We replace Z and Z by D and Z? respectively in the 
equations. For (|39l , the multitask kernel K should be specified by Eq. ([TBI . The calculation of K will 
be expensive when the dimension of the observation d is large, since the time complexity of the matrix 
inversion in ([T3l is generally (d 3 ). For (l40l . we can get the close solution of D as D = — f — — — - — — 



tr( (Ef=i w c w 

where W c is defined in (fT4)) . The derivation is analogous to [8, Appendix 1]. 

Specifying 09\l and (HO]) as a part of DMTRC: We replace Z and Z by fi and ^4 respectively in the 
equations. For (|39l , K should be specified by Eq. (l24l ). The calculation of K will be expensive when 

the task number m is large, since the time complexity of the matrix inversion in (124b is generally also 

„ (J2 C - w T w e )^ 

(m ). For (1401 . we can get the close solution of ft as CI = — f — — — - — ' t >. where W c is defined in 



tr( (Ef=i WfW c 

(T25T ). The derivation is analogous to |[T7] equation 13]. 
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D. Overview of the Algorithm 

The original DMTFC problem is formulated as dD, which is a difficult non-convex MIP problem. We 
relax it to a convex optimization problem (fl9l l which has two equivalent forms (fTTT i and ([T8T i. Similarly, 
the original DMTRC problem is formulated as (|20l l. We relax it to a convex optimization problem (|2TT > 
which has two equivalent forms (l23l and (l26l i. 

Observing that the relaxed DMTFC (fTTT i and DMTRC (l23l) are quite similar, we propose a unified 
convex optimization objective d27T ), which can be solved alternatively by combining several existing 
efficient algorithms (32], 11331 . ll48l — lf5T1 . The optimization procedure is summarized as Algorithm 1 in 
the supplementary material. 

VII. Learning With Nonlinear Kernels 

Incorporating the nonlinear feature mapping to DMTFC and DMTRC, we only need to modify their 
multitask kernel representations. Specifically, for DMTFC, we only need to modify Eq. ([TBI to Kmtf fx^ , 
e^(x} 1 i ) r D(A 1 D+A 2 I d )- 1 0(x} 2 2 )e, 2 and modify Eq. ©toW c = £V a$D (A X D + A^)- 1 0(xj)e 
where </)(•) is the kernel-induced feature mapping. Because </>(•) might be high dimensional or even 
infinite, such as the Radius-Basis-Function (RBF) kernel, we cannot calculate its representation accurately. 
Instead, we can use the kernel decomposition techniques, such as kernel principle component analysis 
or Cholesky decomposition, to get </>(•) approximately and explicitly. Similarly, for DMTRC, we only 
need to modify Eq. {g) to K M tr (x^x^ = ef «(Ai« + X 2 l m y 1 ei 2 K(x i ; , x* 2 ) and modify Eq. 
(ED to W c = £ i V^.a^(x})e i fi(A 1 fi + A 2 I m r\ where K(x,y) = (</>(x), <f>(y)). Because DMTRC 
can incorporate nonlinear kernels implicitly via the kernel function K while DMTFC needs to calculate 
the representation of the feature mapping </>(■) explicitly with additional time and storage complexities 
of at least (n 2 ). DMTRC is more efficient than DMTFC in kernel learning. 

VIII. Complexity Analysis 

Because Algorithm 1 can be seen as a technical combination of SVR-M3C (33, MTFL H, and MTRL 
ifTTll , where the outer two loops of Algorithm 1 is a multitask extension of SVR-M3C and the inner loop 
can be seen as a special case of the multiclass classification extensions of MTFL/MTRL, the overall time 
and storage complexities of Algorithm 1 are dominated by the most expensive algorithm between SVR- 
M3C and MTFL/MTRL. SVR-M3C has a time complexity of (n log n) and a storage complexity of (n) 
11331 . It is also easy to observe that MTFL has a time complexity of (n 2 + d 3 ) and a storage complexity 
of (n 2 ), and that MTRL has a time complexity of (n 2 + m 3 ) and a storage complexity of (n 2 ). Hence, 
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when Algorithm 1 is specified to DMTFC, it is suitable to middle scale and low dimensional problems. 
When it is specified to DMTRC, it is suitable to middle scale problems with small task numbers. The 
main obstacle that hinders DMTFC and DMTRC from large scale problems is the time-demanding kernel 
calculation and matrix inversion in (TT3T > and (l24l . To overcome it, dimension reduction techniques, sparse 
MTL techniques, distributed cluster ensembles and sparse kernel estimations might be helpful. But this 
discussion is beyond the scope of this paper, hence, we leave it as a future work. 

IX. Experiments 

In this section, we will compare the proposed DMTFC and DMTRC algorithms with 10 clustering 
algorithms on the UCI pendigits toy dataset and two benchmark datasets - multi-domain newsgroups 
dataset and multi-domain sentiment dataset. All experiments are conducted with MATLAB 7.12 on a 2.27 
GHZ 8-core Itel(R) Xeon(R) Server running Windows XP with 16 GB memory. The implementations of 
all algorithms and the supplementary material can be downloaded from the attachments. 

The competitive algorithms can be categorized to two classes. The first class are the Single Task 
Clustering (STC) algorithms. They are 1) K-Means (KM), 2) Kernel K-Means (KKM) with the RBF 
kernel, 3) Normalized Cut (NC) (23] with the RBF kernel, 4) the Discriminative STC (DSTC) algorithm, 
5) KM that groups all tasks into a single task (ALL KM), 6) ALL KKM, and 7) ALL NC, where DSTC 
is the single task version of our DMTRC. The DSTCs with the linear kernel and the RBF kernel are 
denoted as DSTQ and DSTC r respectively. The second class are the state-of-the-art MTC algorithms. 
They are 1) Learning the Shared Subspace for MTC (LSSMTC) [38], 2) Learning a Spectral Kernel for 
MTC (LSKMTC) f4lH. and 3) Multitask Bregman Clustering with Pairwise task regularization (MBC- 
P) HDl . The experiments of the competitive algorithms are run exactly with the authors' experimental 
settings. 

For our DMTFC and DMTRC, A x and A 2 are both searched from {2" 10 ,2^ 8 , . . . ,2~ 2 }, we make 
a strong assumption that we know the class distribution beforehand, so that Zj jC in Eq. © is set to 
li :C = l^.y**/nj where y** is the c-th column of the ground truth label matrix Y l of the i-th task. The 
DMTFC and DMTRC with the linear kernel are denoted as DMTFC; and DMTRC; respectively, and 
those with the RBF kernel are denoted as DMTFC r and DMTRQ. respectively. 

The kernel width of all algorithms that work with the RBF kernel is searched from {2 -2 , 2 _1 , 2°, 2 1 , 2 2 }- 
A, where A is the average Euclidean distance of the data. The data are normalized into the range of 
[0,1] in dimension. All computation time is recorded except that consumed on normalizing the dataset. 
The datasets used in experiments are provided with labels. Therefore, the performance is evaluated as 
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comparing the predicted labels with the ground truth labels using Normalized Mutual Information (NMI) 

A. Results on Pendigits Dataset 

In this subsection, the pendigits dataset in the UCI machine learning repository is used as a toy dataset 
for capturing the main characteristics of the proposed DMTC algorithms. The pendigits dataset contains 
10 hand written integer digits ranging from to 9. It consists of 11256 observations and 16 attributes. 
Each digit consists of about 1100 observations. Although the pendigits dataset is a single task clustering 
problem, we generate a multitask clustering problem from it: First, we take 0, 3, 6, 8, 9 as one group, 
and 1,2,4,5,7 as the other group. Then, we repeatedly sample 20 observations from each digit in the 
first group for 3 times. Again, we do the same thing to the second group. Because each repeat forms a 
5-class clustering task that contains 100 observations, we obtain 6 tasks in total, where Tasks 1, 2 and 3 
are examples from the first group and Tasks 4, 5, and 6 are examples from the second group. Because 
the data are too small to cover the distributions of the digits, we can regard Tasks 1, 2 and 3 are relevant 
but not the same, so as to Tasks 4, 5, and 6. We also regard that Tasks 1, 2 and 3 are irrelevant to Tasks 

4, 5, and 6. A visualized example of the data distributions associated with the six tasks are shown in Fig. 
[TJ We run three jobs on the six tasks. Job 1 is to cluster Tasks 1, 2, and 3. Job 2 is to cluster Tasks 4, 

5, and 6. Job 3 is to cluster Tasks 1-6 together. For each MTC job, we repeat the experiment 30 times. 
For each single repeat, we also repeat the referenced algorithms 50 times and report the average results. 
For DMTFC r , KPCA is used for getting </>(x) explicitly. It retains the top 100 largest eigenvalues and 
their eigenvectors. 

Fig. |2] shows the NMI comparison over the three jobs. From the figure, we can get the following 
interesting phenomena. First, except for DMTFQ, the proposed DMTC algorithms achieve higher NMIs 
than the referenced methods. This phenomenon demonstrates the effectiveness of the proposed MTC 
algorithms. Second, except for DMTRC r , the NMIs of all algorithms in Job 3 are lower than those 
in Jobs 1 and 2. This phenomenon is particularly apparent in DMTFQ. It shows that the unrelated 
tasks or the reverse distributions worsen the clustering performance significantly. This phenomenon also 
shows that when the tasks are really related, learning a powerful feature representation is better than 
minimizing the distances between the task-specific models, but when the tasks are irrelevant, learning a 
feature representation forcibly is very harmful while learning the task relationship can avoid the negative 
transfer amazingly. To better explain this, we visualize D and Q. in Figs. [3] and |4] respectively. For 
DMTFC, in Figs. [3^, [3}), |3jl, [3^, and [3]f, the relationships of the features have been learned successfully 
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Fig. 1. Visualization of the tasks on the pendigits data. The true labels are indicated by different colors and different symbols. 
PCA is used to generate the figure. 

■ Job 1 (Clustering Tasks 1-3) "Job 2 (Clustering Tasks 4-6) "Job 3 (Clustering Tasks 1-6) 
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Fig. 2. NMI comparison on the pendigits dataset. 



by DMTFC. But in Fig. [3t, DMTFQ fails in learning a common feature representation, i.e., most features 
are recognized as mutually independent. For DMTRC, in Fig. |4j we can observe that DMTRC can capture 
the relationships of the tasks successfully no matter in Jobs 1 and 2 or in Job 3, which accounts for 
the immunity of DMTRC to the negative transfer. Note that this study has been conducted in many 
supervised MTL works, but to our knowledge, this is the first work that captures the task relationship 
successfully in the unsupervised learning scenario. Third, the referenced MTCs do not achieve better 
NMIs than the STCs. One possible explanation for this is that the referenced MTCs suffer from local 
minima more seriously than the STCs. 

The above experiment assumes that the class distributions are known with all parameters li iC setting 
to the ideal situation l^y**/nj = 0. In this paragraph, we will investigate how the class evenness 
assumption affects the performance by setting all {{ii,c}£Li}£Li to tne same value that is selected from 
{0,0.03,0.1,0.2,0.3}. The results are shown in Fig. [5] From the figure, we can observe the following 
phenomena: 1) In all settings, DMTC can benefit from joint training of all tasks except DMTFQ. 2) 
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(a) Job 1 , linear kernel (b) Job 2, linear kernel (c) Job 3, linear kernel 




(d) Job 1 . RBF kernel (e) Job 2, RBF kernel (f) Job 3, RBF kernel 




Fig. 3. Visualization of the shared feature filter learned by DMTFC on the pendigits dataset (i.e. the learned covariance between 
the features, i.e. D). The more grey the grid is, the weaker the filter contributes to the new feature representation. 



(a) Job 1 , linear kernel (b) Job 2, linear kernel (c) Job 3, linear kernel 
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Fig. 4. Hinton diagram of the task relationship learned by DMTRC on the pendigits dataset (i.e. the learned covariance between 
the task-specific models, i.e. ft). The grid in green means the tasks are related. The grid in red means the tasks are reverse. The 
bigger the grid is, the more positive/negative the relationship is. 



Setting the class balance parameters to a value 0.03 that is slightly biased from the ideal situation can 
achieve even better performance, which means that if we select I properly around the ideal value, the 
performance is guaranteed. 3) DMTC is sensitive to I, if parameter I is set improperly, the performance 
will degrade dramatically. Hence, for DMTC's practical use, we should select I carefully. 

B. Results on Multi-Domain Newsgroups Dataset 

The 20-newsgroups dataset is a widely used benchmark dataset that is a collection of about 20000 
messages collected from 20 different usenet newsgroups, 1000 messages from each. After postprocessing, 
each message is a vector with 26214 dimensions. We define a three class MTC job on the 20-newsgroups 
in Table HI From the table, we can see that Tasks 1 and 2 are highly related, Tasks 1 to 5 are somewhat 
related, while Task 6 seems an outlier task. Based on the above task definition, we generate 4 MTC 
problems by randomly selecting 5%, 10%, 20%, and 40% of the data from each class, so as to observe 
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Fig. 5. Clustering performance with respect to the class balance parameter / on the pendigits dataset. 

TABLE I 

Task definition on the 20-newsgroups dataset. 



ID 


Names of the classes 


Task 1 


comp.sys. mac. hardware vs. rec. sport. hockey vs. sci. electronics 


Task 2 


comp.sys.ibm.pc. hardware vs. rec. sport. baseball vs. sci. crypt 


Task 3 


comp. windows. x vs. rec.autos vs. talk.politics.guns 


Task 4 


comp.os.ms-windows.misc vs. sci.med vs. talk. politics. mi deast 


Task 5 


rec. motorcycles vs. sci. space vs. talk.politics.misc 


Task 6 


misc.forsale vs. alt. atheism vs. soc. religion. christian 



how the data number influences the effectiveness of DMTC. Because most algorithms are quite inefficient 
in high dimensional datasets, we use PCA to project the dataset to a 100-dimensional subspace. DMTC 
and DSTC only use the linear kernel. The DMTRQ and DSTQ without the PCA projection, which are 
denoted as *DMTRQ and *DSTQ respectively, will also be investigated. 

Fig. [6] shows the NMI comparison. From the figure, we can observe the following experimental 
phenomena. First, the proposed convex discriminative clustering algorithms are apparently better than 
the referenced methods in the same experimental environment. Second, DMTRQ is much better than 
DSTC; which shows that the task relationship is learned successfully. Third, DMTFQ is slightly worse 
than DSTQ which means that we cannot learn a strong shared feature representation across the tasks. 
This phenomenon might be caused by the PCA projection where much useful information for constructing 
the feature representation is lost, however, we cannot get its performance in the original dataset due to its 
inefficiency in high dimensional data. Fourth, when the PCA projection is used to form the experimental 
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■ 5% data "10% data "20% data "40% data 
100% I 




Fig. 6. NMI comparison on the 20-newsgroups dataset. a% is short for "experiments running with a% data of the dataset.' 



(a) 5% data (b) 1 0% data (c) 20% data (d) 40% data 







Fig. 7. Visualizations of D of DMTFC; on the 20-newsgroups dataset. 



environment, the performances of the clustering algorithms are getting worse when more data is used. On 
the contrary, when PCA is not used, the performances of both *DSTQ and *DMTRQ are getting better. 
This phenomenon tells us that when more data is available, the features should provide more abundant 
information so as to make the models available to be more complicated for describing the more variant 
distributions. It also shows the power of DSTC and DMTRC on high dimensional datasets. Moreover, 
it demonstrates that the power of the proposed discriminative clusterings do not rely on the predefined 
models for describing the data distribution which is an apparent superiority to the generative clusterings. 

To show how well the feature representation is learned, we visualize D of DMTFQ in Fig. [7] From 
the figure, we can see that most features are considered as mutually independent, which also accounts 
for the ineffectiveness of DMTFC;. 

To demonstrate how well the task relationship is learned, we list the hinton diagrams of fi of DMTRC; 
and *DMTRQ in Figs. [8] and [9] respectively. The figures show that both methods can learn the task 
relationships in different percentages of data equivalently well. They also show that the task relationship 
is different from what we have defined in Table U As an example, Task 6 is originally designed as an 
outlier task, but it contributes to the performance positively. This phenomenon is worth of further study. 

Fig. [TO] gives the CPU time comparison. From the figure, we can easily observe that although the 
proposed methods have higher absolute time, in fact, both the proposed algorithms and the referenced 
methods have a time complexity of (n 2 ) except KM, LSKMTC and MBC-P, which means that they are 
all unavailable for large-scale problems. 
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Fig. 8. Hinton diagrams of tt of DMTRQ on the 20-newsgroups dataset. 



1 

2 


(a) 5% data 




(b) 10% data 




(c) 20% data 




(d) 40% data 


■ ■ ■ - ■ 

■ ■ ■ p * 


1 

2 


■ ■ ■ , ■ 
r P r r r r 


1 
2 


rr ■ , r 
■ ■ ■ ■ ■ ■ 


1 
2 


■ ■ ■ ■ ■ ■ 


3 


■ ■ ■ ■ ■ 


3 


. ■ . 


3 


■ ■ ■ ■ 






4 


r r r r - r 


4 


r r |~ r r 


4 








5 


, r , r r < 


5 


j ■ ■ ■ ■ ■ 


5 


. . . ■ . 


I 




6 


. r , r r p 


6 


r T r r • P 


6 


r r r r r p 








1 2 3 4 5 6 




1 2 3 4 5 6 




1 2 3 4 5 6 




1 2 3 4 5 6 



Fig. 9. Hinton diagrams of tt of *DMTRC; on the 20-newsgroups dataset. 

The results on each individual task and the stability analysis are described in the supplementary 
materials. 

C. Results on Multi-Domain Sentiment Dataset 

The multi-domain sentiment dataset is a widely used benchmark dataset that was originally designed for 
the MTL research propose. It contains product reviews taken from Amazon.com from many product types 
(domains or tasks). For a convenient comparison with the supervised MTFL and MTRL, we adopt the 
same experimental setting as ifTTl . Specifically, the dataset in use is a postprocessed version^ that aims to 
classify the reviews of some products to two classes: positive or negative reviews. It contains four binary- 
class tasks: books, DVDs, electronics, and kitchen appliances. Each task contains 2000 observations, in 
which 1000 reviews are labeled as positive and the other 1000 as negative. Each observation is a vector 
with 473853 dimensions f| We generate 3 MTC problems by randomly selecting 10%, 30%, and 50% of 
the data from each task. Other experimental settings are the same as those on the 20-newsgroups dataset. 

Fig. [TT] gives the NMI comparison. The experimental phenomena are quite similar with those on the 
20-newsgroups dataset. The only difference is that when more data is available and when PCA is used to 
project the high dimensional dataset to a low dimensional space, the clustering algorithms are generally 



4 http://www.cs.jhu.edu/~mdredze/datasets/sentiment/processed_acl.tar.gz 
5 We discarded 3 features that contain unrecognized characters. 
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Fig. 10. CPU time comparison on the 20-newsgroups dataset. 



■ 10% data "30% data "50% data 




Fig. 11. NMI comparison on the sentiment dataset. a% is short for "experiments running with a% data of the dataset." 

getting better on the sentiment dataset while the algorithms are getting worse on the 20-newsgroups 
dataset. This might be caused by the difficulties of the datasets. That is to say, projecting the data to 100 
dimensional subspace is enough to catch the useful information on the sentiment dataset while doing so 
is not enough on the 20-newsgroups dataset. To support this explanation, we visualize D of DMTFQ in 
Fig. [12] and compare it with the visualizations of D in Fig. [7] We can clearly see that the feature filters 
D on the sentiment dataset is more effective than those on the 20-newsgroups dataset. 

We provide the hinton diagrams of fi of DMTRQ and *DMTRQ in Figs. [13] and Figs. [14] without 
a further explanation. We further provide the performance of the proposed clustering algorithms on the 
individual tasks in Fig. [15] The general experimental phenomena in Fig. [15] are consistent with those in 
Fig. [TTI and are comparable with those yielded by the supervised counterparts of the proposed clusterings, 
i.e. MTFL and MTRL (see [17, Section 4.3]). 

X. Conclusions and Future Work 

In this paper, we have proposed a novel Bayesian DMTC framework. Within the framework, we 
have implemented two multiclass DMTC objectives by specifying the framework with four assumptions. 
The first one, named DMTFC, works under a Gaussian process prior that models a shared feature 
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Fig. 12. Visualizations of D of DMTFC; on the sentiment dataset. 
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Fig. 13. Hinton diagrams of fi of DMTRC; on the sentiment dataset. 

representation across tasks, while the second one, named DMTRC, works under a Gaussian process 
prior that models the task relationship. Both objectives are formulated as difficult MIP problems. We 
have further relaxed the MIP problems to convex optimization problems and solve the relaxed problems 
efficiently in a uniform alternating optimization procedure. Technically, the two convex DMTC algorithms 
can be seen as the objective combination of the supervised MTFL/MTRL and the unsupervised SVR-M3C. 
Experimental comparison with 7 STC algorithms as well as 3 state-of-the-art MTC algorithms on the 
pendigits, multi-domain newsgroups and multi-domain sentiment datasets demonstrated the effectiveness 
of the proposed algorithms. In the future, we are interested in lowering the complexities of the DMTCs 
and finding a new DMTC that can learn both a shared feature representation and the task relationship. 
We are also interested in clustering different tasks to different and "correct" clusters automatically with 
as less external prior knowledge as possible. Moreover, it is shown in Fig. [5] that the proposed DMTCs 
are sensitive to the class balance parameters, we will try to solve the problem by incorporating other 
clustering algorithms or cluster ensemble methods. 
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